Abstract
Sickle cell disease (SCD) is a genetic hemoglobinopathy affecting 100,000 Americans – predominantly Black/African-Americans. The most common genotype in the U.S. is homozygous sickle hemoglobin (HbSS). Over 2 million people are asymptomatic genetic carriers, known as sickle cell trait (SCT). Launched in 2018, the All of Us Research Program (AoURP) is an observational public health longitudinal study and biobank that includes more than 600,000 participants with initial data. In AoURP, we can ascertain participants with SCD and SCT by short read whole genome sequencing (srWGS), and/or diagnosis codes (ICD9/10/SNOMED) from electronic health records (EHR). AoURP is a valuable resource for studying SCD, offering medical records across health systems, clinical labs, diagnoses, genomic data, and survey responses. Specifically, AoURP is uniquely positioned to examine the predictive value of EHR diagnosis to ascertain SCD.
We extracted genomic and diagnosis records for participants with relevant SCT and SCD data as follows: (a) srWGS: ≥ 1 HbS variant OR, (b) EHR: ≥ 1 SCT or SCD diagnosis code. From the 633,546 AoURP participants, we identified a cohort of 6,942 participants with ≥1 HbS copy or ≥ 1 SCT or SCD diagnosis code. Seventy four percent identified as Black/African-American, 62% were female. According to srWGS, the genotypes identified included Homozygous HbSS (2%), Heterozygous HbSC (2%), Heterozygous HbSβ+ Thalassemia or HbSβ0 Thalassemia (0.3%), SCT (88%), and Unaffected or Other hemoglobinopathy (8%). The criteria of at least 1 SNOMED SCD diagnosis code has moderate sensitivity (0.85, 95% CI: 0.81 , 0.89), high specificity (0.92, 95% CI: 0.91 , 0.93), high negative predictive value (0.999, 95% CI: 0.99 , 0.99), but low positive predictive value (0.31, 95% CI: 0.28, 0.34) relative to short read whole genome sequencing to identify participants with any type of sickle cell disease.
We demonstrate that ICD9/10 and SNOMED codes in AoURP can be compared against whole genome sequencing data as part of an approach to identify participants with SCD and SCT. Although a simple diagnosis code-based query has low positive predictive value, it has excellent negative predictive value. Future work should aim to improve validity by testing nuanced algorithms with different counts and arrangements of ICD9/10 and SNOMED codes to identify SCD and SCT in this data source. The absence of a diagnosis code is highly reliable to rule-out SCD. Since srWGS is not available for all AoURP participants, a combined genomic + diagnosis code approach will be needed to maximize cohort size for longitudinal public health studies of SCD and SCT.
This feature is available to Subscribers Only
Sign In or Create an Account Close Modal